Hayden Hoopes
This analysis explores the possibilities of using deep learning models to accurately predict the emotions contained in spoken phrases. The goal is to create a model that can accurately associate each different audio file with one of eight different emotions: neutral, calm, happy, sad, angry, fearful, disgust, or surprised. To do so, I will create several different models, each of which I expect to successively increase in prediction accuracy. The models that I plan to create are as follows:
The RAVDESS data set contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:
Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
Repetition (01 = 1st repetition, 02 = 2nd repetition).
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
Filename example: 03-01-06-01-02-01-12.wav
Audio-only (03) Speech (01) Fearful (06) Normal intensity (01) Statement "dogs" (02) 1st Repetition (01) 12th Actor (12) Female, as the actor ID number is even.
The first thing I need to do is get the data into the correct format so that I can process the audio files in a neural network environment. I'll place each of the files and their associated labels into lists that I can then use to create a Tensorflow Dataset object.
import os
import warnings
warnings.filterwarnings("ignore")
# Get all of the file paths into an array and the class labels (emotions) into an array of the same size
data_path = 'audio_speech_actors_01-24/'
file_paths = []
labels = []
label_dict = {
1: 'neutral',
2: 'calm',
3: 'happy',
4: 'sad',
5: 'angry',
6: 'fearful',
7: 'disgust',
8: 'surprised'
}
for actor in os.listdir(data_path):
class_path = os.path.join(data_path, actor)
for file in os.listdir(class_path):
labels.append(int(file.split('-')[2])) # This extracts the class label from the file name and appends it to the labels list
file_paths.append(os.path.join(class_path, file)) # This adds the file path to the file_paths list
import numpy as np
from sklearn.model_selection import train_test_split
import librosa
train_paths, val_paths, train_labels, val_labels = train_test_split(file_paths, labels, test_size=0.2, random_state=1)
train_labels = np.array(train_labels)
val_labels = np.array(val_labels)
def load_and_process_audio(file_path, label):
audio, sample_rate = librosa.load(file_path)
audio = audio[:75000] # Cut off audio with more than 75000 samples
audio = np.concatenate((np.zeros(75000-len(audio)), audio), axis=0)
spectrogram = np.sqrt(librosa.feature.melspectrogram(y=audio, sr=44100))
return spectrogram, label
Now that the functions for loading the audio files in as spectrograms is complete, I can load the audio files into Tensorflow Dataset objects.
import tensorflow as tf
train_data = []
for file, label in zip(train_paths, train_labels):
spectrogram, label = load_and_process_audio(file, label)
train_data.append(spectrogram)
val_data = []
for file, label in zip(val_paths, val_labels):
spectrogram, label = load_and_process_audio(file, label)
val_data.append(spectrogram)
sparse_train_labels = np.zeros((train_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
sparse_train_labels[:, i] = (train_labels == value).astype(int)
sparse_val_labels = np.zeros((val_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
sparse_val_labels[:, i] = (val_labels == value).astype(int)
batch_size = 32
train_dataset = tf.data.Dataset.from_tensor_slices((train_data, sparse_train_labels))
train_dataset = train_dataset.shuffle(buffer_size=len(train_paths)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((val_data, sparse_val_labels))
val_dataset = val_dataset.batch(batch_size)
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
Now that we have the datasets created, let's visualize some of the different audio samples. To see the differences between emotions, we will stick to visualizing only emotions that use the phrase "Kids are talking by the door".
import matplotlib.pyplot as plt
phrase_1_audio_files = [(path, label) for path, label in zip(file_paths, labels) if path.split('\\')[1].split('-')[4] == '01']
phrase_1_audio_file_paths = [i[0] for i in phrase_1_audio_files]
phrase_1_labels = [i[1] for i in phrase_1_audio_files]
new_class_positions = []
new_classes = []
for i, label in enumerate(phrase_1_labels):
if label not in new_classes:
new_class_positions.append(i)
new_classes.append(label)
plt.figure(figsize=(25, 15))
fig, axs = plt.subplots(2, 4, figsize=(12, 6))
for i in range(8):
spectrogram, label = load_and_process_audio(phrase_1_audio_file_paths[new_class_positions[i]], phrase_1_labels[new_class_positions[i]])
axs[i//4, i%4].imshow(tf.transpose(spectrogram))
axs[i//4, i%4].set_title(f'Spectrogram: \'{label_dict[label]}\'')
plt.show()
<Figure size 2500x1500 with 0 Axes>
Next, I'll compute the accuracy metric for a baseline model (ie. random guessing). I can do this using the labels that I already extracted previously in the training data. It is worth noting that randomly assigning a single class to all observations would result in an accuracy score of 13.3%, meaning that this metric is a good baseline to start with.
Surprisingly, after flattening all of the features of the spectrograms, a simple decision tree model with no tuning was able to classify values in the validation set with about 30.6% accuracy. This is a tremendous increase in accuracy from random guessing, but I think that better models can be created to beat this accuracy metric.
from collections import Counter
import pandas as pd
pd.Series(Counter(labels)) / len(labels)
1 0.066667 2 0.133333 3 0.133333 4 0.133333 5 0.133333 6 0.133333 7 0.133333 8 0.133333 dtype: float64
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
flattened_train_data = np.array([s.flatten() for s in train_data])
flattened_val_data = np.array([s.flatten() for s in val_data])
dt = DecisionTreeClassifier()
dt.fit(flattened_train_data, train_labels)
predictions = dt.predict(flattened_val_data)
print(classification_report(val_labels, predictions))
print(accuracy_score(val_labels, predictions))
precision recall f1-score support 1 0.23 0.35 0.28 17 2 0.48 0.50 0.49 32 3 0.17 0.16 0.16 44 4 0.17 0.17 0.17 41 5 0.37 0.56 0.44 27 6 0.31 0.29 0.30 45 7 0.41 0.28 0.33 46 8 0.34 0.31 0.32 36 accuracy 0.31 288 macro avg 0.31 0.33 0.31 288 weighted avg 0.31 0.31 0.30 288 0.3055555555555556
Next, I'll use an artificial neural network (ANN) to try and increase the accuracy of this classification model slightly. While I don't think that the neural network will perform significantly better than the baseline model, I do expect the model to spot nonlinear patterns in the data set that could give it additional classification power.
As evidenced below, the simple neural network created a model with 2,419,176 parameters that ended up performing with a 45% accuracy against the validation set. This performance is better than the decision tree model, but still likely isn't seeing localized patterns in the data set because all dimensional information is being lost when the spectrogram is flattened into a single dimension array.
From the graphs, it appears that the loss on the validation model constantly increases as the model goes through more iterations of backpropagation. This could indicate that the model is instantly overfitting. I might even say that the model peaked somewhere around 3 or 4 iterations, which is where the validation accuracy is highest.
According to the classification report, the model is best at predicting class 2 (calm) but never identified any audio recordings of class 1 (neutral). Perhaps convolutional neural networks will help this model improve its ability to predict this and all other classes.
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(units=128, activation='relu', input_shape=(18816,)),
layers.Dense(units=64, activation='relu', input_shape=(18816,)),
layers.Dense(units=32, activation='relu'),
layers.Dense(units=8, activation='softmax')
])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 128) 2408576 dense_1 (Dense) (None, 64) 8256 dense_2 (Dense) (None, 32) 2080 dense_3 (Dense) (None, 8) 264 ================================================================= Total params: 2419176 (9.23 MB) Trainable params: 2419176 (9.23 MB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
callbacks = [keras.callbacks.ModelCheckpoint('ann.keras', save_best_only=True)]
history = model.fit(flattened_train_data, sparse_train_labels, validation_data=(flattened_val_data, sparse_val_labels), epochs=25, batch_size=32, callbacks=callbacks)
Epoch 1/25 WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead. WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead. 36/36 [==============================] - 1s 13ms/step - loss: 1.9496 - accuracy: 0.1970 - val_loss: 1.8686 - val_accuracy: 0.2292 Epoch 2/25 36/36 [==============================] - 0s 9ms/step - loss: 1.6300 - accuracy: 0.3950 - val_loss: 1.7147 - val_accuracy: 0.3785 Epoch 3/25 36/36 [==============================] - 0s 8ms/step - loss: 1.3318 - accuracy: 0.5295 - val_loss: 1.8072 - val_accuracy: 0.3785 Epoch 4/25 36/36 [==============================] - 0s 8ms/step - loss: 1.0776 - accuracy: 0.6328 - val_loss: 1.7492 - val_accuracy: 0.3958 Epoch 5/25 36/36 [==============================] - 0s 9ms/step - loss: 0.8656 - accuracy: 0.7179 - val_loss: 1.7125 - val_accuracy: 0.4549 Epoch 6/25 36/36 [==============================] - 0s 7ms/step - loss: 0.6912 - accuracy: 0.7977 - val_loss: 1.9694 - val_accuracy: 0.4757 Epoch 7/25 36/36 [==============================] - 0s 8ms/step - loss: 0.5592 - accuracy: 0.8429 - val_loss: 2.2183 - val_accuracy: 0.4618 Epoch 8/25 36/36 [==============================] - 0s 7ms/step - loss: 0.4474 - accuracy: 0.8776 - val_loss: 2.4538 - val_accuracy: 0.4722 Epoch 9/25 36/36 [==============================] - 0s 7ms/step - loss: 0.3720 - accuracy: 0.8993 - val_loss: 2.5328 - val_accuracy: 0.4514 Epoch 10/25 36/36 [==============================] - 0s 8ms/step - loss: 0.2732 - accuracy: 0.9332 - val_loss: 2.4485 - val_accuracy: 0.5035 Epoch 11/25 36/36 [==============================] - 0s 7ms/step - loss: 0.2249 - accuracy: 0.9410 - val_loss: 2.9475 - val_accuracy: 0.4826 Epoch 12/25 36/36 [==============================] - 0s 8ms/step - loss: 0.2047 - accuracy: 0.9618 - val_loss: 2.8294 - val_accuracy: 0.4514 Epoch 13/25 36/36 [==============================] - 0s 7ms/step - loss: 0.2024 - accuracy: 0.9575 - val_loss: 2.9124 - val_accuracy: 0.4757 Epoch 14/25 36/36 [==============================] - 0s 7ms/step - loss: 0.1161 - accuracy: 0.9783 - val_loss: 3.1302 - val_accuracy: 0.4965 Epoch 15/25 36/36 [==============================] - 0s 7ms/step - loss: 0.1635 - accuracy: 0.9714 - val_loss: 3.0397 - val_accuracy: 0.4931 Epoch 16/25 36/36 [==============================] - 0s 7ms/step - loss: 0.0842 - accuracy: 0.9844 - val_loss: 3.2675 - val_accuracy: 0.4583 Epoch 17/25 36/36 [==============================] - 0s 7ms/step - loss: 0.0602 - accuracy: 0.9939 - val_loss: 3.4523 - val_accuracy: 0.4722 Epoch 18/25 36/36 [==============================] - 0s 8ms/step - loss: 0.0806 - accuracy: 0.9905 - val_loss: 3.5126 - val_accuracy: 0.5000 Epoch 19/25 36/36 [==============================] - 0s 7ms/step - loss: 0.0304 - accuracy: 0.9965 - val_loss: 4.1639 - val_accuracy: 0.4965 Epoch 20/25 36/36 [==============================] - 0s 7ms/step - loss: 0.0544 - accuracy: 0.9878 - val_loss: 4.2558 - val_accuracy: 0.4306 Epoch 21/25 36/36 [==============================] - 0s 7ms/step - loss: 0.0131 - accuracy: 1.0000 - val_loss: 4.2340 - val_accuracy: 0.4444 Epoch 22/25 36/36 [==============================] - 0s 7ms/step - loss: 0.1419 - accuracy: 0.9783 - val_loss: 4.7363 - val_accuracy: 0.4444 Epoch 23/25 36/36 [==============================] - 0s 7ms/step - loss: 0.0533 - accuracy: 0.9913 - val_loss: 4.3147 - val_accuracy: 0.4618 Epoch 24/25 36/36 [==============================] - 0s 8ms/step - loss: 0.0083 - accuracy: 1.0000 - val_loss: 4.5180 - val_accuracy: 0.4583 Epoch 25/25 36/36 [==============================] - 0s 8ms/step - loss: 0.0933 - accuracy: 0.9852 - val_loss: 3.9724 - val_accuracy: 0.4861
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
plt.title('Accuracy By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()
test_model = keras.models.load_model('ann.keras')
predicted = test_model.predict(flattened_val_data)
predicted = np.argmax(predicted, axis=1)+1
print(classification_report(predicted, val_labels))
9/9 [==============================] - 0s 3ms/step precision recall f1-score support 1 0.18 0.30 0.22 10 2 0.94 0.48 0.63 63 3 0.50 0.35 0.42 62 4 0.17 0.28 0.21 25 5 0.70 0.49 0.58 39 6 0.27 0.60 0.37 20 7 0.50 0.66 0.57 35 8 0.42 0.44 0.43 34 accuracy 0.45 288 macro avg 0.46 0.45 0.43 288 weighted avg 0.56 0.45 0.48 288
In this step, I will build a convolutional neural network (CNN) that uses the spectrograms of the audio files as if they were images and learns about different features of the images there. Hopefully, this CNN will have a better performance than both the decision tree model (31% accuracy) and the artificial neural network model (45% accuracy).
In the end, the CNN model instantly started overfitting as seen in the graph that shows the validation loss skyrocket. Although the validation accuracy does seem to increase as more epochs occur, the increasing validation loss seems to indicate that the model is overfitting to features that aren't really there, creating a model that does not generalize well to new data. In this case, the best CNN model fitted produced an accuracy score of just 42%.
I am actually quite surprised that a regular artificial neural network outperformed the convolutional neural network. This could have happened because of random chance or it could be possible that the spectrograms simply don't provide enough information (especially regarding localized patterns) to classify emotions better than a simply array of values.
inputs = keras.Input(shape=(128, 147, 1), name='Input')
x = layers.Conv2D(filters=32, kernel_size=3, activation='relu', name='convolution_layer_1')(inputs)
x = layers.MaxPooling2D(pool_size=2, name='pooling_1')(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation='relu', name='convolution_layer_2')(x)
x = layers.MaxPooling2D(pool_size=2, name='pooling_2')(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation='relu', name='convolution_layer_3')(x)
x = layers.MaxPooling2D(pool_size=2, name='pooling_3')(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation='relu', name='convolution_layer_4')(x)
x = layers.MaxPooling2D(pool_size=2, name='pooling_4')(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation='relu', name='convolution_layer_5')(x)
x = layers.Flatten()(x)
outputs = layers.Dense(8, activation='softmax', name='output')(x)
model = keras.Model(inputs=inputs, outputs=outputs, name='base_cnn')
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\layers\pooling\max_pooling2d.py:161: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead. Model: "base_cnn" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= Input (InputLayer) [(None, 128, 147, 1)] 0 convolution_layer_1 (Conv2 (None, 126, 145, 32) 320 D) pooling_1 (MaxPooling2D) (None, 63, 72, 32) 0 convolution_layer_2 (Conv2 (None, 61, 70, 64) 18496 D) pooling_2 (MaxPooling2D) (None, 30, 35, 64) 0 convolution_layer_3 (Conv2 (None, 28, 33, 128) 73856 D) pooling_3 (MaxPooling2D) (None, 14, 16, 128) 0 convolution_layer_4 (Conv2 (None, 12, 14, 256) 295168 D) pooling_4 (MaxPooling2D) (None, 6, 7, 256) 0 convolution_layer_5 (Conv2 (None, 4, 5, 256) 590080 D) flatten (Flatten) (None, 5120) 0 output (Dense) (None, 8) 40968 ================================================================= Total params: 1018888 (3.89 MB) Trainable params: 1018888 (3.89 MB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
callbacks = [keras.callbacks.ModelCheckpoint('cnn.keras', save_best_only=True, monitor='val_loss')]
history = model.fit(train_dataset, validation_data=val_dataset, epochs=25, batch_size=32, callbacks=callbacks)
Epoch 1/25 36/36 [==============================] - 4s 104ms/step - loss: 1.8681 - accuracy: 0.2552 - val_loss: 1.9079 - val_accuracy: 0.2882 Epoch 2/25 36/36 [==============================] - 4s 100ms/step - loss: 1.7166 - accuracy: 0.3568 - val_loss: 1.8403 - val_accuracy: 0.3194 Epoch 3/25 36/36 [==============================] - 4s 100ms/step - loss: 1.6283 - accuracy: 0.3594 - val_loss: 1.6883 - val_accuracy: 0.3611 Epoch 4/25 36/36 [==============================] - 4s 102ms/step - loss: 1.5379 - accuracy: 0.4089 - val_loss: 1.7005 - val_accuracy: 0.3472 Epoch 5/25 36/36 [==============================] - 4s 102ms/step - loss: 1.4286 - accuracy: 0.4540 - val_loss: 1.8021 - val_accuracy: 0.3438 Epoch 6/25 36/36 [==============================] - 4s 102ms/step - loss: 1.3505 - accuracy: 0.5061 - val_loss: 1.5675 - val_accuracy: 0.4167 Epoch 7/25 36/36 [==============================] - 4s 99ms/step - loss: 1.2124 - accuracy: 0.5590 - val_loss: 1.7132 - val_accuracy: 0.4236 Epoch 8/25 36/36 [==============================] - 4s 106ms/step - loss: 1.0518 - accuracy: 0.6128 - val_loss: 1.8883 - val_accuracy: 0.4271 Epoch 9/25 36/36 [==============================] - 4s 105ms/step - loss: 0.9699 - accuracy: 0.6458 - val_loss: 1.6499 - val_accuracy: 0.4757 Epoch 10/25 36/36 [==============================] - 4s 101ms/step - loss: 0.8401 - accuracy: 0.6944 - val_loss: 2.3165 - val_accuracy: 0.4583 Epoch 11/25 36/36 [==============================] - 4s 102ms/step - loss: 0.7120 - accuracy: 0.7587 - val_loss: 3.0223 - val_accuracy: 0.3958 Epoch 12/25 36/36 [==============================] - 4s 104ms/step - loss: 0.6233 - accuracy: 0.7899 - val_loss: 2.7621 - val_accuracy: 0.4306 Epoch 13/25 36/36 [==============================] - 4s 105ms/step - loss: 0.5644 - accuracy: 0.8212 - val_loss: 2.6264 - val_accuracy: 0.4514 Epoch 14/25 36/36 [==============================] - 4s 101ms/step - loss: 0.4248 - accuracy: 0.8507 - val_loss: 2.2267 - val_accuracy: 0.5035 Epoch 15/25 36/36 [==============================] - 4s 101ms/step - loss: 0.3680 - accuracy: 0.8689 - val_loss: 2.9568 - val_accuracy: 0.4931 Epoch 16/25 36/36 [==============================] - 4s 101ms/step - loss: 0.3448 - accuracy: 0.8967 - val_loss: 4.6338 - val_accuracy: 0.4792 Epoch 17/25 36/36 [==============================] - 4s 101ms/step - loss: 0.3116 - accuracy: 0.9167 - val_loss: 3.5512 - val_accuracy: 0.5000 Epoch 18/25 36/36 [==============================] - 4s 101ms/step - loss: 0.2173 - accuracy: 0.9401 - val_loss: 4.4431 - val_accuracy: 0.5208 Epoch 19/25 36/36 [==============================] - 4s 104ms/step - loss: 0.3077 - accuracy: 0.9436 - val_loss: 3.8495 - val_accuracy: 0.5521 Epoch 20/25 36/36 [==============================] - 4s 103ms/step - loss: 0.2302 - accuracy: 0.9280 - val_loss: 3.1742 - val_accuracy: 0.5417 Epoch 21/25 36/36 [==============================] - 4s 103ms/step - loss: 0.1583 - accuracy: 0.9696 - val_loss: 4.1780 - val_accuracy: 0.5486 Epoch 22/25 36/36 [==============================] - 4s 106ms/step - loss: 0.1756 - accuracy: 0.9557 - val_loss: 3.9359 - val_accuracy: 0.5347 Epoch 23/25 36/36 [==============================] - 4s 105ms/step - loss: 0.0883 - accuracy: 0.9800 - val_loss: 4.0909 - val_accuracy: 0.5729 Epoch 24/25 36/36 [==============================] - 4s 104ms/step - loss: 0.1403 - accuracy: 0.9661 - val_loss: 4.7601 - val_accuracy: 0.5729 Epoch 25/25 36/36 [==============================] - 4s 103ms/step - loss: 0.0966 - accuracy: 0.9766 - val_loss: 4.6087 - val_accuracy: 0.5278
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
plt.title('Accuracy By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()
test_model = keras.models.load_model('cnn.keras')
predicted = test_model.predict(val_dataset)
predicted = np.argmax(predicted, axis=1)+1
print(classification_report(predicted, val_labels))
9/9 [==============================] - 0s 30ms/step precision recall f1-score support 1 0.29 0.23 0.26 22 2 0.75 0.33 0.46 72 3 0.18 0.40 0.25 20 4 0.15 0.26 0.19 23 5 0.44 0.75 0.56 16 6 0.47 0.48 0.47 44 7 0.52 0.52 0.52 46 8 0.56 0.44 0.49 45 accuracy 0.42 288 macro avg 0.42 0.43 0.40 288 weighted avg 0.50 0.42 0.43 288
Since I was a little disappointed with the performance of the model in the previous step, I would like to try adding some augmented images to the data set to try and give the model more data to generalize to. Hopefully, this will allow the model to predict with more confidence, minimizing the validation loss and increasing the validation accuracy.
The augmentation performed on the spectrograms includes time masking (nullifying spaces of time to introduce noise), frequency masking (nullifying frequencies to introduce noise), and pitch shifting (altering the pitch to produce variations in the data set). With luck, these augmentations will produce additional data that the model can use to learn better patterns for identifying emotions in the audio data.
In the end, the model that used augmented data reached a validation accuracy of 51%, which is much better than the previous CNN model with an accuracy of 42%. Thus, it appears that the data augmentation worked!
def time_masking(spectrogram, num_masks=4):
# This function grabs a certain window of values in the spectrogram and sets them all randomly to 0, introducing noise into the data set
for i in range(num_masks):
t = np.random.randint(15, 50)
t0 = np.random.randint(0, spectrogram.shape[1] - t)
spectrogram[:, t0:t0+t] = 0
return spectrogram
def frequency_masking(spectrogram, num_masks=4):
# This function does the inverse of time masking and actually randomly sets some frequencies to 0 so that they are not seen by the model
for i in range(num_masks):
f = np.random.randint(5, 15)
f0 = np.random.randint(0, spectrogram.shape[0] - f)
spectrogram[f0:f0+f, :] = 0
return spectrogram
def load_and_process_augmented_audio(file_path, label):
audio, sample_rate = librosa.load(file_path)
audio = audio[:75000] # Cut off audio with more than 75000 samples
audio = np.concatenate((np.zeros(75000-len(audio)), audio), axis=0)
audios = [audio]
for i in range(4): # Let's return the original audio plus four other audios that have had some transformations applied to them for each audio in the data set
audio = librosa.effects.pitch_shift(audio, sr=44100, n_steps=i) # do some pitch shifting
audios.append(audio)
spectrograms = [np.sqrt(librosa.feature.melspectrogram(y=audio, sr=44100)) for audio in audios]
for i in range(1,4):
spectrograms[i] = time_masking(spectrograms[i])
spectrograms[i] = frequency_masking(spectrograms[i])
return spectrograms, [label]*5
augmented_train_data = []
augmented_train_labels = []
for file, label in zip(train_paths, train_labels):
spectrograms, labels = load_and_process_augmented_audio(file, label)
augmented_train_data.extend(spectrograms)
augmented_train_labels.extend(labels)
val_data = []
for file, label in zip(val_paths, val_labels):
spectrogram, label = load_and_process_audio(file, label)
val_data.append(spectrogram)
sparse_train_labels = np.zeros((len(augmented_train_labels),8))
for i, value in enumerate(np.arange(1,9)):
sparse_train_labels[:, i] = (augmented_train_labels == value).astype(int)
sparse_val_labels = np.zeros((len(val_labels),8))
for i, value in enumerate(np.arange(1,9)):
sparse_val_labels[:, i] = (val_labels == value).astype(int)
batch_size = 32
train_dataset = tf.data.Dataset.from_tensor_slices((augmented_train_data, sparse_train_labels))
train_dataset = train_dataset.shuffle(buffer_size=len(train_paths)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((val_data, sparse_val_labels))
val_dataset = val_dataset.batch(batch_size)
Let's visualize some of the augmented spectrograms. The first image on the left is the original spectogram.
plt.figure(figsize=(25, 15))
fig, axs = plt.subplots(1, 5, figsize=(16, 10))
for i in range(5):
spectrogram = augmented_train_data[i]
label = augmented_train_labels[i]
axs[i].imshow(tf.transpose(spectrogram))
axs[i].set_title(f'Spectrogram: \'{label_dict[label]}\'')
plt.show()
<Figure size 2500x1500 with 0 Axes>
inputs = keras.Input(shape=(128, 147), name='Input')
x = layers.Conv1D(filters=512, kernel_size=3, activation='relu', name='convolution_layer_1')(inputs)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2, name='pooling_1')(x)
x = layers.Conv1D(filters=256, kernel_size=3, activation='relu', name='convolution_layer_2')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2, name='pooling_2')(x)
x = layers.Conv1D(filters=128, kernel_size=3, activation='relu', name='convolution_layer_3')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2, name='pooling_3')(x)
x = layers.Dropout(0.2)(x)
x = layers.Conv1D(filters=64, kernel_size=3, activation='relu', name='convolution_layer_4')(x)
x = layers.Flatten()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(512, activation='relu', name='dense')(x)
x = layers.BatchNormalization()(x)
outputs = layers.Dense(8, activation='softmax', name='output')(x)
model = keras.Model(inputs=inputs, outputs=outputs, name='base_cnn')
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
Model: "base_cnn" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= Input (InputLayer) [(None, 128, 147)] 0 convolution_layer_1 (Conv1 (None, 126, 512) 226304 D) batch_normalization (Batch (None, 126, 512) 2048 Normalization) pooling_1 (MaxPooling1D) (None, 63, 512) 0 convolution_layer_2 (Conv1 (None, 61, 256) 393472 D) batch_normalization_1 (Bat (None, 61, 256) 1024 chNormalization) pooling_2 (MaxPooling1D) (None, 30, 256) 0 convolution_layer_3 (Conv1 (None, 28, 128) 98432 D) batch_normalization_2 (Bat (None, 28, 128) 512 chNormalization) pooling_3 (MaxPooling1D) (None, 14, 128) 0 dropout (Dropout) (None, 14, 128) 0 convolution_layer_4 (Conv1 (None, 12, 64) 24640 D) flatten_1 (Flatten) (None, 768) 0 dropout_1 (Dropout) (None, 768) 0 dense (Dense) (None, 512) 393728 batch_normalization_3 (Bat (None, 512) 2048 chNormalization) output (Dense) (None, 8) 4104 ================================================================= Total params: 1146312 (4.37 MB) Trainable params: 1143496 (4.36 MB) Non-trainable params: 2816 (11.00 KB) _________________________________________________________________
callbacks = [keras.callbacks.ModelCheckpoint('cnn_augmented.keras', save_best_only=True, monitor='val_loss')]
history = model.fit(train_dataset, validation_data=val_dataset, epochs=25, batch_size=32, callbacks=callbacks)
Epoch 1/25 180/180 [==============================] - 7s 31ms/step - loss: 2.1027 - accuracy: 0.2479 - val_loss: 2.7594 - val_accuracy: 0.0799 Epoch 2/25 180/180 [==============================] - 6s 31ms/step - loss: 1.8302 - accuracy: 0.3189 - val_loss: 2.6191 - val_accuracy: 0.1076 Epoch 3/25 180/180 [==============================] - 6s 31ms/step - loss: 1.7357 - accuracy: 0.3483 - val_loss: 2.3662 - val_accuracy: 0.2188 Epoch 4/25 180/180 [==============================] - 5s 30ms/step - loss: 1.6094 - accuracy: 0.3910 - val_loss: 2.0876 - val_accuracy: 0.4062 Epoch 5/25 180/180 [==============================] - 5s 30ms/step - loss: 1.4966 - accuracy: 0.4429 - val_loss: 1.8757 - val_accuracy: 0.4688 Epoch 6/25 180/180 [==============================] - 5s 30ms/step - loss: 1.4130 - accuracy: 0.4707 - val_loss: 1.7945 - val_accuracy: 0.4861 Epoch 7/25 180/180 [==============================] - 5s 30ms/step - loss: 1.3350 - accuracy: 0.5017 - val_loss: 1.8775 - val_accuracy: 0.4549 Epoch 8/25 180/180 [==============================] - 5s 30ms/step - loss: 1.2323 - accuracy: 0.5451 - val_loss: 2.0714 - val_accuracy: 0.4236 Epoch 9/25 180/180 [==============================] - 5s 30ms/step - loss: 1.1659 - accuracy: 0.5700 - val_loss: 1.9651 - val_accuracy: 0.4410 Epoch 10/25 180/180 [==============================] - 5s 30ms/step - loss: 1.1013 - accuracy: 0.5880 - val_loss: 2.3449 - val_accuracy: 0.4410 Epoch 11/25 180/180 [==============================] - 6s 31ms/step - loss: 1.0273 - accuracy: 0.6214 - val_loss: 1.8953 - val_accuracy: 0.5312 Epoch 12/25 180/180 [==============================] - 6s 31ms/step - loss: 0.9589 - accuracy: 0.6552 - val_loss: 1.7928 - val_accuracy: 0.5139 Epoch 13/25 180/180 [==============================] - 6s 32ms/step - loss: 0.9093 - accuracy: 0.6703 - val_loss: 2.0152 - val_accuracy: 0.4722 Epoch 14/25 180/180 [==============================] - 6s 33ms/step - loss: 0.8719 - accuracy: 0.6844 - val_loss: 1.9575 - val_accuracy: 0.4931 Epoch 15/25 180/180 [==============================] - 6s 33ms/step - loss: 0.8112 - accuracy: 0.7035 - val_loss: 2.0537 - val_accuracy: 0.5174 Epoch 16/25 180/180 [==============================] - 6s 32ms/step - loss: 0.7723 - accuracy: 0.7234 - val_loss: 1.9738 - val_accuracy: 0.4896 Epoch 17/25 180/180 [==============================] - 6s 32ms/step - loss: 0.7297 - accuracy: 0.7365 - val_loss: 2.3430 - val_accuracy: 0.4618 Epoch 18/25 180/180 [==============================] - 6s 32ms/step - loss: 0.6975 - accuracy: 0.7507 - val_loss: 2.2241 - val_accuracy: 0.5000 Epoch 19/25 180/180 [==============================] - 6s 32ms/step - loss: 0.6583 - accuracy: 0.7609 - val_loss: 2.1033 - val_accuracy: 0.4722 Epoch 20/25 180/180 [==============================] - 6s 32ms/step - loss: 0.6371 - accuracy: 0.7700 - val_loss: 2.2368 - val_accuracy: 0.4514 Epoch 21/25 180/180 [==============================] - 6s 32ms/step - loss: 0.6028 - accuracy: 0.7847 - val_loss: 2.3910 - val_accuracy: 0.4861 Epoch 22/25 180/180 [==============================] - 6s 32ms/step - loss: 0.5558 - accuracy: 0.8016 - val_loss: 2.1928 - val_accuracy: 0.5382 Epoch 23/25 180/180 [==============================] - 6s 32ms/step - loss: 0.5618 - accuracy: 0.8010 - val_loss: 2.4219 - val_accuracy: 0.4792 Epoch 24/25 180/180 [==============================] - 6s 33ms/step - loss: 0.5363 - accuracy: 0.8083 - val_loss: 2.3705 - val_accuracy: 0.5174 Epoch 25/25 180/180 [==============================] - 6s 33ms/step - loss: 0.4990 - accuracy: 0.8170 - val_loss: 1.9816 - val_accuracy: 0.5556
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
plt.title('Accuracy By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()
test_model = keras.models.load_model('cnn_augmented.keras')
predicted = test_model.predict(val_dataset)
predicted = np.argmax(predicted, axis=1)+1
print(classification_report(predicted, val_labels))
9/9 [==============================] - 0s 11ms/step precision recall f1-score support 1 0.53 0.38 0.44 24 2 0.72 0.92 0.81 25 3 0.48 0.36 0.41 59 4 0.17 0.64 0.27 11 5 0.56 0.39 0.46 38 6 0.58 0.63 0.60 41 7 0.39 0.44 0.41 41 8 0.81 0.59 0.68 49 accuracy 0.51 288 macro avg 0.53 0.54 0.51 288 weighted avg 0.56 0.51 0.52 288
This code never actually ran correctly due to lack of RAM in my computer. Even when I tried to run the code in Google Colab after decreasing the batch size, Jupyter/Colab crashed before the code could run. However, my hypothesis is that this model actually would not have worked better than the previous model for emotion classification. The OpenL3 model was trained to recognize patterns in speech, but since the objective of this analysis is predicting an emotion and not text, I don't think it would have been pre-trained in such a way that it would aid the model's predictions anyway.
def load_and_process_audio(file_path, label):
audio, sample_rate = librosa.load(file_path, sr=44100)
target_length = 48000
if len(audio) < target_length:
audio = np.pad(audio, (0, target_length - len(audio)))
elif len(audio) > target_length:
audio = audio[:target_length]
spectrogram = librosa.feature.melspectrogram(y=audio, sr=44100)
if spectrogram.shape[1] < target_length:
spectrogram = np.pad(spectrogram, ((0, 0), (0, target_length - spectrogram.shape[1])))
elif spectrogram.shape[1] > target_length:
spectrogram = spectrogram[:, :target_length]
spectrogram = np.expand_dims(spectrogram, axis=0)
return spectrogram, label
train_data = []
for file, label in zip(train_paths, train_labels):
spectrogram, label = load_and_process_audio(file, label)
train_data.append(spectrogram)
val_data = []
for file, label in zip(val_paths, val_labels):
spectrogram, label = load_and_process_audio(file, label)
val_data.append(spectrogram)
sparse_train_labels = np.zeros((train_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
sparse_train_labels[:, i] = (train_labels == value).astype(int)
sparse_val_labels = np.zeros((val_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
sparse_val_labels[:, i] = (val_labels == value).astype(int)
batch_size = 32
train_dataset = tf.data.Dataset.from_tensor_slices((train_data, sparse_train_labels))
train_dataset = train_dataset.shuffle(buffer_size=len(train_paths)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((val_data, sparse_val_labels))
val_dataset = val_dataset.batch(batch_size)
import openl3
model = openl3.models.load_audio_embedding_model(input_repr="mel256", content_type="music", embedding_size=512)
for layer in model.layers[:-2]:
layer.trainable = False
inputs = keras.Input(shape=(1, 48000), name='Input')
x = model(inputs)
outputs = layers.Dense(8, activation='softmax')(x)
model = keras.Model(inputs=inputs, outputs=outputs, name='transfer_model')
model.compile(optimizer='rmsprop', loss='categorical_cross_entropy', metrics=['accuracy'])
model.summary()
callbacks = [keras.callbacks.ModelCheckpoint('transfer.keras', save_best_only=True, monitor='val_loss')]
history = model.fit(train_dataset, validation_data=val_dataset, epochs=25, batch_size=32, callbacks=callbacks)
The best model that I was able to produce for identifying emotions in spoken audio was the convolutional neural network built with augmented data. This model performed with a 51% accuracy when tested against the validation data set. The model was best able to correctly classify audios with class "calm" and "fearful". The model was worst at classifying audios that were "sad".
One thing that I did not mention at the beginning of this analysis that I did to increase the performance of all the models was scale the values of the spectrogram using the np.sqrt()
function. The first time that I created these neural networks with plain spectrograms, the values were so faint that they barely showed up at all in the audio images. Scaling seemed to increase the performance of all models by around 10% each. However, because I am new at audio analysis, I think that there is more that I can do to exaggerate the values in the future to pull out even more features in the spectrograms. This would allow the models to identify more and better features in each of the spectrograms that could be used for better classification.
Data augmentation actually made this analysis a great one. Adding some data augmentation techniques improved the peformance of the model by around 9%. However, I didn't know anything about audio analysis when I began this analysis and learned everything that I know about audio transformations as I wrote the code to execute it. There are many other kinds of audio transformations that I think I could add into this data augmentation phase that I think could improve the performance of the model even more. I would like to explore these techniques further in the future.
In the end, this project taught me a lot about audio analysis and the kinds of models and techniques that can be used to analyze audio. I feel more empowered than ever to explore audio classification again in the future.